Homework 5

Become familiar with Permutation-based Variable Importance (PVI).

Dataset used for experiments with explanation libraries is Heart Attack Dataset. Most of the experiments are run using XGBClassifier.

Task 1.

As we can see, the most important feature for XGBClassifier is thall with score of 0.019. Quite high we can observe age and sex. All of those features are expected to have great impact on patients condition.

task1 solution

Task 2.

I have trained 3 different models to compare the results. Those are RandomForestClassifier, Logistic Regression and Logistic Regression but with transformed features with StandardScaler. Results between two Logistic Regression models are more or less the same. Among the most important features we can find sex, and different values of cp or caa. However, age is not among the most important features (it was for XGBClassifier).

Some similarities can be observed between XGBClassifier and RandomForestClassifier, such as high tall importance. Features, such as sex and age, were marked as much less important.

Interesting but quite obvious difference in importance of features is between tree-based models and LogisticRegression. Both models types learn different patterns in dataset so naturally, importance is distributed differently between features.



task1 solution task1 solution task1 solution

Task 3.A

Let's now measure feature importance for the XGBClassifier using different approach. This time, feature called thall also receives the highest score. exng_1, which previously wasn't present among top features, now is rated as the second most important one. sex and age, as pretty much all the time, have their part on the top.


task1 solution

Task 3.B

One more time, results differ slightly, when shap importance is used. thall, sex and age achieved expectedly high score. oldpeak is surprisingly higher than in any other metric analysis. Summarising, all of those methods are quite useful in feature importance analysis. Choice of the best one might be quite difficult though. Different methods make use of different approaches, so it is important to understand them as good as you can to make the right choice for analysed problem.


task1 solution

Appendix

In [ ]:
!pip install pandas
!pip install plotly
!pip install seaborn
!pip install sklearn
!pip install xgboost
!pip install imblearn
!pip install dalex
!pip install shap
!pip install lime
In [2]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
import dalex as dx
import shap
import sklearn
In [3]:
df = pd.read_csv('data/heart.csv')
categorical_cols = ['exng', 'caa', 'cp', 'restecg']

ohe = OneHotEncoder(handle_unknown='ignore', sparse=False, drop='first')
df[categorical_cols] = df[categorical_cols].astype('category')
df_tr = df.copy()

ohe.fit(df_tr[categorical_cols])
df_tr[ohe.get_feature_names_out(categorical_cols)] = ohe.transform(df_tr[categorical_cols])
df_tr.drop(columns=categorical_cols, inplace=True)

X, y = df_tr.drop(columns=['output']), df_tr['output']
In [29]:
model1 = XGBClassifier()
model1.fit(X, y)
Out[29]:
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=100,
              n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [6]:
patients = X.iloc[[14, 108]]#.sample(2) #14, 108
print(patients.index)
model1.predict_proba(patients)
Int64Index([14, 108], dtype='int64')
Out[6]:
array([[8.0226064e-03, 9.9197739e-01],
       [4.6610832e-04, 9.9953389e-01]], dtype=float32)
In [30]:
explainer = dx.Explainer(model1, X, y)
Preparation of a new explainer is initiated

  -> data              : 303 rows 19 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 303 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x7f9778dd30a0> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.000237, mean = 0.545, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.177, mean = 3.57e-05, max = 0.139
  -> model_info        : package xgboost

A new explainer has been created!
In [31]:
explainer.model_performance()
Out[31]:
recall precision f1 accuracy auc
XGBClassifier 1.0 1.0 1.0 1.0 1.0

1.

In [32]:
pvi = explainer.model_parts(random_state=0)
pvi.result
Out[32]:
variable dropout_loss label
0 _full_model_ 0.000000 XGBClassifier
1 restecg_2 0.000000 XGBClassifier
2 caa_3 0.000000 XGBClassifier
3 caa_4 0.000000 XGBClassifier
4 fbs 0.000000 XGBClassifier
5 cp_1 0.000000 XGBClassifier
6 restecg_1 0.000057 XGBClassifier
7 exng_1 0.000373 XGBClassifier
8 caa_2 0.000523 XGBClassifier
9 cp_2 0.000843 XGBClassifier
10 cp_3 0.001098 XGBClassifier
11 trtbps 0.001230 XGBClassifier
12 slp 0.001410 XGBClassifier
13 caa_1 0.005428 XGBClassifier
14 chol 0.007791 XGBClassifier
15 oldpeak 0.008292 XGBClassifier
16 sex 0.009679 XGBClassifier
17 age 0.009794 XGBClassifier
18 thalachh 0.009851 XGBClassifier
19 thall 0.018819 XGBClassifier
20 _baseline_ 0.492754 XGBClassifier
In [33]:
pvi.plot(show=False).update_layout(autosize=False, width=600, height=450)

2.

In [19]:
model2 = RandomForestClassifier()
model2.fit(X, y)
explainer2 = dx.Explainer(model2, X, y)
explainer2.model_performance()
Preparation of a new explainer is initiated

  -> data              : 303 rows 19 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 303 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x7f9778dd30a0> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.0, mean = 0.545, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.4, mean = -0.000231, max = 0.38
  -> model_info        : package sklearn

A new explainer has been created!
X does not have valid feature names, but RandomForestClassifier was fitted with feature names
Out[19]:
recall precision f1 accuracy auc
RandomForestClassifier 1.0 1.0 1.0 1.0 1.0
In [20]:
pvi2 = explainer2.model_parts(random_state=0)
pvi2.result
Out[20]:
variable dropout_loss label
0 _full_model_ 0.000000 RandomForestClassifier
1 restecg_2 0.000000 RandomForestClassifier
2 caa_4 0.000000 RandomForestClassifier
3 fbs 0.000000 RandomForestClassifier
4 cp_1 0.000000 RandomForestClassifier
5 restecg_1 0.000040 RandomForestClassifier
6 caa_3 0.000044 RandomForestClassifier
7 cp_3 0.000044 RandomForestClassifier
8 caa_2 0.000316 RandomForestClassifier
9 trtbps 0.000709 RandomForestClassifier
10 cp_2 0.000764 RandomForestClassifier
11 slp 0.000861 RandomForestClassifier
12 chol 0.001168 RandomForestClassifier
13 age 0.001700 RandomForestClassifier
14 thalachh 0.001913 RandomForestClassifier
15 sex 0.002121 RandomForestClassifier
16 exng_1 0.002321 RandomForestClassifier
17 caa_1 0.002804 RandomForestClassifier
18 oldpeak 0.005606 RandomForestClassifier
19 thall 0.013054 RandomForestClassifier
20 _baseline_ 0.490786 RandomForestClassifier
In [35]:
pvi2.plot(show=False).update_layout(autosize=False, width=600, height=450)
In [21]:
model3 = LogisticRegression(max_iter=1000)
model3.fit(X, y)
explainer3 = dx.Explainer(model3, X, y)
explainer3.model_performance()
Preparation of a new explainer is initiated

  -> data              : 303 rows 19 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 303 values
  -> model_class       : sklearn.linear_model._logistic.LogisticRegression (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x7f9778dd30a0> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.00275, mean = 0.545, max = 0.996
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.952, mean = 3.49e-05, max = 0.975
  -> model_info        : package sklearn

A new explainer has been created!
X does not have valid feature names, but LogisticRegression was fitted with feature names
Out[21]:
recall precision f1 accuracy auc
LogisticRegression 0.927273 0.845304 0.884393 0.867987 0.938735
In [22]:
pvi3 = explainer3.model_parts(random_state=0)
pvi3.result
Out[22]:
variable dropout_loss label
0 fbs 0.060927 LogisticRegression
1 _full_model_ 0.061265 LogisticRegression
2 restecg_2 0.061282 LogisticRegression
3 caa_4 0.061546 LogisticRegression
4 age 0.062161 LogisticRegression
5 restecg_1 0.063443 LogisticRegression
6 chol 0.064335 LogisticRegression
7 cp_1 0.067154 LogisticRegression
8 caa_3 0.067532 LogisticRegression
9 trtbps 0.070338 LogisticRegression
10 slp 0.070369 LogisticRegression
11 exng_1 0.070597 LogisticRegression
12 oldpeak 0.075094 LogisticRegression
13 cp_3 0.076658 LogisticRegression
14 thall 0.078085 LogisticRegression
15 thalachh 0.083469 LogisticRegression
16 caa_2 0.089820 LogisticRegression
17 caa_1 0.090259 LogisticRegression
18 cp_2 0.092881 LogisticRegression
19 sex 0.099631 LogisticRegression
20 _baseline_ 0.493412 LogisticRegression
In [36]:
pvi3.plot(show=False).update_layout(autosize=False, width=600, height=450)
In [27]:
model4 = LogisticRegression(max_iter=1000)
sc = StandardScaler()
X_tr = pd.DataFrame(sc.fit_transform(X), index=X.index, columns=X.columns)
model4.fit(X_tr, y)
explainer4 = dx.Explainer(model4, X_tr, y)
explainer4.model_performance()
Preparation of a new explainer is initiated

  -> data              : 303 rows 19 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 303 values
  -> model_class       : sklearn.linear_model._logistic.LogisticRegression (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x7f9778dd30a0> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.000817, mean = 0.545, max = 0.998
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.972, mean = 6.39e-07, max = 0.987
  -> model_info        : package sklearn

A new explainer has been created!
X does not have valid feature names, but LogisticRegression was fitted with feature names
Out[27]:
recall precision f1 accuracy auc
LogisticRegression 0.921212 0.853933 0.886297 0.871287 0.940492
In [28]:
pvi4 = explainer4.model_parts(random_state=0)
pvi4.result
Out[28]:
variable dropout_loss label
0 _full_model_ 0.059508 LogisticRegression
1 restecg_2 0.059688 LogisticRegression
2 fbs 0.059741 LogisticRegression
3 caa_4 0.060492 LogisticRegression
4 age 0.060804 LogisticRegression
5 restecg_1 0.061919 LogisticRegression
6 chol 0.062600 LogisticRegression
7 cp_1 0.067628 LogisticRegression
8 exng_1 0.068256 LogisticRegression
9 oldpeak 0.068959 LogisticRegression
10 trtbps 0.069381 LogisticRegression
11 caa_3 0.070044 LogisticRegression
12 slp 0.074739 LogisticRegression
13 thall 0.074765 LogisticRegression
14 thalachh 0.075736 LogisticRegression
15 cp_3 0.078243 LogisticRegression
16 caa_1 0.093808 LogisticRegression
17 cp_2 0.093939 LogisticRegression
18 caa_2 0.097255 LogisticRegression
19 sex 0.099073 LogisticRegression
20 _baseline_ 0.492565 LogisticRegression
In [37]:
pvi4.plot(show=False).update_layout(autosize=False, width=600, height=450)

3.A

In [41]:
import plotly.express as px

px.bar(pd.DataFrame({'variable': X.columns, 'importance': model1.feature_importances_}), x='variable', y='importance', title='Gini-based Variable Importance')

3.B

In [54]:
explainer_shap = shap.explainers.Tree(model1, X, model_output='probability')
shap.plots.bar(explainer_shap(X))